ECI 588: Text Mining in Education
Unit 1: Week 2 Code-Along
Adapted from:
Data Science in Education Using R - Chapter 6
https://datascienceineducation.com
Data Science in a Box - Hello World!
https://datasciencebox.com
R Tools
Bring it Together
Arithmetic, Logic, & Assignment
Numbers and basic arithmetic operators (+ , - , *, /) behave as expected.
Type this in your console:
2 + 3
You'll get this:
[1] 5
Arithmetic Operators
R can also perform logical operations.
What happens when you type this in your console?
5 > 3
Or this?
5 == 3
What symbol seems to be missing? Or rather has been replaced?
Logical Operators
Bad Form
my_variable = 2 + 3
my_variable
[1] 5
Good Form
my_dog <- "Howie"
my_dog
[1] "Howie"
The Dollar $ign Used to select variables.
trees$Height
[1] 70 65 63 72 81 83
Now you try:
Select Girth,
assign to variable tree_girth,
print to console.
The Question Mark?
Used to get help.
?trees
Everything is an object in R!
We'll be creating a lot of them…
Especially objects of the class data frame
Numeric
my_sum <- 3+3
class(my_sum)
[1] "numeric"
print(my_sum)
[1] 6
Character
my_cat <- "Muffins"
class(my_cat)
[1] "character"
my_cat
[1] "Muffins"
Vector
my_vector <- c(1,2,3,4)
class(my_vector)
[1] "numeric"
print(my_vector)
[1] 1 2 3 4
List
my_list <- c(1,"Kat", 3)
class(my_list)
[1] "character"
my_list
[1] "1" "Kat" "3"
Matrix
my_matrix <- matrix(1:9, nrow = 3, ncol = 3)
print(my_matrix)
[,1] [,2] [,3]
[1,] 1 4 7
[2,] 2 5 8
[3,] 3 6 9
Data Frame
my_df <- trees
head(my_df)
Girth Height Volume
1 8.3 70 10.3
2 8.6 65 10.3
3 8.8 63 10.2
4 10.5 72 16.4
5 10.7 81 18.8
6 10.8 83 19.7
Aside from being a certain class, objects can have have other attributes, such column names and dimensions:
colnames(my_df)
[1] "Girth" "Height" "Volume"
dim(my_df)
[1] 31 3
And can even be used with operators! Try this:
my_sum * my_matrix
Functions are pre-packaged pieces of code that (typically) start with a verb, followed by objects or inputs in parentheses called “arguments”:
do_this(to_this)
do_that(to_this, to_that, with_those)
Basic template for creating a function:
name_of_function <- function(argument_1, argument_2){
code_that_does_something
code_that_does_something_else
}
Try making this basic addition function:
add_numbers <- function(number_1, number_2) {
number_1 + number_2
}
Use your new function to add 1 and 2:
add_numbers(1, 2)
[1] 3
What happens if we try add the objects we created earlier?
add_numbers(my_sum, my_matrix)
Collections of R code that contain functions, data, and/or documentation.
The tidyverse is an opinionated collection of R packages designed for data science we'll use a lot.
Installing a Package
# template for installing a package
install.packages("package_name")
# example of installing a package
install.packages("tidytext")
Loading a Package
# template for loading a package
library(package_name)
# example of loading a package
libary(tidytext)
Your home for code, files, reports and more.
Eliminates the need for computer specific files paths like this:
/Volumes/GoogleDrive/My Drive/College of Ed/Learning Analytics/Courses/ECI 588 Text Mining/R/eci-588/unit-1/img/project.png
And replaces with this, which anyone can run:
unit-1/img/project.png
Live Demo
Copy and paste from DSIEUR:
# install devtools
install.packages("devtools", repos = "http://cran.us.r-project.org")
# install the dataedu package (requires R version 3.6 or higher)
devtools::install_github("data-edu/dataedu")
# Installing the skimr package, not included in {dataedu}
install.packages("skimr")
Let's set up our environment:
library(tidyverse)
library(dataedu)
library(skimr)
library(janitor)
What do you think running the above lines of code accomplished?
How do you know?
What do those conflicts mean?
Just specify the package followed by ::
# using the filter() function from the stats package
x <- 1:100
stats::filter(x, rep(1, 3))
# using the filter() function from the dplyr package
starwars %>%
dplyr::filter(mass > 85)
dataedu::ma_data_init
dataedu::ma_data_init -> ma_data
ma_data_init <- dataedu::ma_data_init
names(ma_data_init)
glimpse(ma_dat_init)
glimpse(ma_data_init)
summary(ma_data_init)
glimpse(ma_data_init$Town)
summary(ma_data_init$Town)
glimpse(ma_data_init$AP_Test Takers)
glimpse(ma_data_init$`AP_Test Takers`)
summary(ma_data_init$`AP_Test Takers`)
ma_data_init %>%
group_by(District Name) %>%
count()
ma_data_init %>%
group_by(`District Name`) %>%
count()
ma_data_init %>%
group_by(`District Name`) %>%
count() %>%
filter(n > 10)
ma_data_init %>%
group_by(`District Name`) %>%
count() %>%
filter(n > 10) %>%
arrange(desc(n))
Take the ma_data_init dataset and then
group it by District Name and then
count (the number of schools in a district) and then
filter for Districts with more than 10 schools and then
arrange the list of Districts and the number of schools in each District in descending order, based on the number of schools.
ma_data_init %>%
group_by(`District Name`) %>%
count() %>%
filter(n = 10)
ma_data_init %>%
group_by(`District Name`) %>%
count() %>%
filter(n == 10)
ma_data_init %>%
rename(district_name = `District Name`,
grade = Grade) %>%
select(district_name, grade)
Why the error for the second and not the others?
ma data <-
ma_data_init %>%
clean_names()
01_ma_data <-
ma_data_init %>%
clean_names()
$_ma_data <-
ma_data_init %>%
clean_names()
ma_data_01 <-
ma_data_init %>%
clean_names()
MA_data_02 <-
ma_data_init %>%
clean_names()
Required:
Chapters 5 & 6 of DSIEUR
Recommended:
Swirl
LinkedIn Learning